Dating Texts without Explicit Temporal Cues 



Abhimanu Kumar 

School of Computer Science 
Carnegie Mellon University 
abhimank@cs . emu . edu 

Matthew Lease 

School of Information 
University Of Texas at Austin 
ml@i school . utexas . edu 

Abstract 

This paper tackles temporal resolution of doc- 
uments, such as determining when a docu- 
ment is about or when it was written, based 
only on its text. We apply techniques from 
information retrieval that predict dates via 
language models over a discretized timeline. 
Unlike most previous works, we rely solely 
on temporal cues impUcit in the text. We 
consider both document-likelihood and diver- 
gence based techniques and several smoothing 
methods for both of them. Our best model pre- 
dicts the mid-point of individuals' lives with 
a median of 22 and mean error of 36 years 
for Wikipedia biographies from 3800 B.C. to 
the present day. We also show that this ap- 
proach works well when training on such bi- 
ographies and predicting dates both for non- 
biographical Wikipedia pages about specific 
years (500 B.C. to 2010 A.D.) and for pubhca- 
tion dates of short stories (1798 to 2008). To- 
gether, our work shows that, even in absence 
of temporal extraction resources, it is possible 
to achieve remarkable temporal locaUty across 
a diverse set of texts. 

1 Introduction 

Temporal analysis of text has been an active area of 
research since the early days of text mining with 
different focus in different disciplines. In early 
computational linguistics research it was primarily 
concerned with the fine-grained ordering of tempo- 
ral events (Allen, 1983; Vilain, 1982). Informa- 
tion retrieval research has focused largely on time- 
sensitive document ranking (Dakka et al., 2008; Li 
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and Croft, 2003), temporal organization of search 
results (Alonso et al., 2009), and how queries and 
documents change over time (Kulkarni et al, 201 1). 

This paper explores temporal analysis models that 
use ideas present in both computational linguistics 
and information retrieval. While some prior research 
has focused on extracting explicit mentions of tem- 
poral expressions (Alonso et al., 2009), we inves- 
tigate the feasibility of using text alone to assign 
timestamps to documents. Following previous doc- 
ument dating work (de long et al., 2005; Kanhabua 
and N0rvag, 2008; Kumar et al., 201 1), we construct 
supervised language models that capture the tempo- 
ral distribution of words over chronons, which are 
contiguous atomic time spans used to discretized the 
timeline. Each chronon model is smoothed by inter- 
polation with the entire training set collection. For 
each test document, a unigram language model is 
computed and used to find the document's similarity 
with each chronon's language model. This provides 
a ranking over chronons for the document, repre- 
senting the document's likelihood of being similar to 
the time periods covered by each chronon (de Jong 
et al., 2005; Kanhabua and N0rvag, 2008). 

Our chronon models are learned from Wikipedia 
biographies spanning 3800 B.C. to 2010 A.D. 
Wikipedia-based training is advantageous since its 
recency enables us to control against stylistic vs. 
content factors influencing vocabulary use (e.g. con- 
sider the difference between WilUam Mavor's 1796 
discussion^ of Sir Walter Raleigh vs. a modem 
retrospective biography^). This contrasts with re- 

'http: //bit . ly/lKRSAa 

^http : / /en . wikipedia . org/wiki/Walter_ 



sources such as the Google n-grams corpus (Michel 
et al., 2010), which is based on publication dates, 
and thus reflects information about when a docu- 
ment was written rather than what it is about. 

Our methods, all of which use the Wikipedia bi- 
ographies for training models, are evaluated on three 
tasks. The first is matched to the training data: pre- 
dict the mid-point of an individual's life based on 
the text in his or her Wikipedia biography. Our best 
model achieves a median error of 22 years and a 
mean error of 36 years. The second task is to pre- 
dict the year for a set of events between 500 B.C. 
and 2010 A.D., using Wikipedia's pages for events 
in each year.^ The best model gives a mean error 
of 36 years and median error of 21 years. The fi- 
nal task is predicting the publication dates of short 
stories from the Gutenberg project from the period 
1798 to 2008.^ In comparison to biographies, these 
stories have far fewer mentions of historical named 
entities that with peaked time signatures useful for 
prediction. This, plus the difference in genre be- 
tween Wikipedia biographies (training) and works 
of fiction, stand to make this task more challenging. 
However, the distributions learned from the biogra- 
phies prove to be quite robust here: our best model 
achieves a mean error of 20 years and a median error 
of 17 years from the true publication date. 

Our primary contribution is demonstrating the ro- 
bustness and informativity of the implicit temporal 
cues available in text alone, across a diverse set of 
three prediction tasks. We do so for document col- 
lections spanning hundreds and thousands of years, 
whereas previous work has generally focused on rel- 
atively short periods (decades) for recent time spans. 
Note that we use a robust temporal expression iden- 
tifier for English, Heidel-Time (Strotgen and Gertz, 
2010), to identify and remove dates from all texts for 
both training and testing. While one could exploit a 
resource such as Heidel-Time to perform rule-based 
document dating (possibly in combination with our 
methods and others such as (Chambers, 2012)), this 
work demonstrates that text-based techniques can be 
used effectively for languages for which such tem- 
poral extraction resources are not available (Heidel- 

Raleigh 

^http : / / en . wikipedia . org/wiki/List_of_ 
years 

http : / /www . gutenberg . org 



Time has resources only for English, German and 
Dutch). 

A second contribution is a thorough exploration 
of the information retrieval approach for this task, 
including consideration of three different tech- 
niques for smoothing chronon language models and 
a comparison of generative (document-likelihood) 
and KL-divergence models for identifying the best 
chronon for a test document. We find that straight- 
forward Jelenik-Mercer smoothing (basic linear in- 
terpolation) works the best, and that both document 
likelihood and KL-divergence based approaches 
perform similarly. 

A specific task of interest in digital humanities 
is to identify and visualize text sequences relat- 
ing to the same time period across a collection of 
books. Our approach can be used to timestamp 
subsequences of documents, which could be book- 
length narratives or fictions, without explicit dates. 

2 Related Work 

Corpora for temporal evaluation. With increased 

focus on temporal analysis, there have been efforts 
to create richly annotated corpora to train and eval- 
uate temporal models, e.g. TimeBank (Pustejovsky 
et al., 2003) and Wikiwars (Mazur and Dale, 2010) 
were created to provide a common set of corpora 
for evaluating time-sensitive models. (201 1) use the 
above corpora to resolve geographic and temporal 
references in text while (2008) use these to model 
event structures. 

Semantic based temporal models. Time- 
sensitive models have also been developed using se- 
mantic properties of data. (Grishman et al., 2002) 
use semantic properties of web-data to create and 
automatically update a database on infectious dis- 
ease outbreaks. Other simpler approaches have been 
explored to analyse literary and historical documents 
as well as recent datasets such as tweets and search 
queries. Time based analysis of historical texts pro- 
vides important information as to how significant 
events happened in the past on a temporal scale. The 
Google N-Grams viewer^, which uses word counts 
from millions of books and corresponding publica- 
tion date, provides plots of n-gram word sequences 
over a timeline (Michel et al., 2010). This gives use- 

^http : / / ngrams . googlelabs . com/ 



ful insights into historical trends of events/topics and 
writing styles. Time based analysis of tweets has 
gained popularity in recent years especially to cap- 
ture current trending topics for tracking news items 
and market sentiment (Zhang et al, 2010). 

Time aware latent models. Another approach 
for temporal text analysis is latent variable based 
graphical models. Dynamic Topic Models (Blei and 
Lafferty, 2006) are used to analyze the evolution 
of topics over time in a large document collection 
(Wang et al, 2008). (2006) analyse variations in 
topic occurrences over a large corpora for a fixed 
time period. (2008) investigate the history of ideas 
in a research field though latent variable approaches. 
(2007) use graphical models for temporal analysis of 
blogs and Zhang et al.(Zhang et al., 2010) provide 
clustering techniques for time varying text corpora 
through hierarchical Dirichlet processes for model- 
ing time sensitivity. 

Temporal analysis using conventional language 
models. Time based text analysis has been ex- 
plored using conventional language model based ap- 
proaches for various applications e.g. time-sensitive 
query interpretation (Li and Croft, 2003; Dakka 
et al., 2008), time-based presentation of search re- 
sults (Alonso et al., 2009), and modelling query and 
document changes over time (KuUcarni et al, 2011). 
(Li and Croft, 2003), one of the early temporal lan- 
guage models, use explicit document dates to esti- 
mate a more informative document prior. More re- 
cently, (2008) propose models for identifying im- 
portant time intervals likely to be of interest for a 
query incorporating document publication date into 
the ranking function. (2009) use explicit tempo- 
ral metadata and expressions as attributes to cluster 
documents and create timelines for exploring search 
results. 

Document dating — the task of this paper — is a 
closely related problem. (2005) follow a language 
model based approach to assign dates to Dutch 
newspaper articles from 1999-2005 by partition- 
ing the timeline into discrete time periods. (2008) 
extend this work to incorporate temporal entropy 
and search statistics from Google Zeitgeist. These 
approaches (de Jong et al., 2005; Kanhabua and 
N0rvag, 2008) normalize the evidence for each 
chronon by the whole collection. (2012) improve 
over these by including linguistic constraints such 



as NERs, POS tagging and regular expression based 
temporal relation constraints (e.g. "after", "be- 
fore" etc.) and using MaxEnt classifier for train- 
ing. (2012) use linguistics features such as sen- 
tence length, context, entity list in a document etc. 
to discover events over twitter and assign time stamp 
by framing it as a binary classification problem with 
the two classes as relevant and non-relevant. But, all 
these approaches worked for a small time range (6- 
10 years) but our datasets span around 5000 years 
and the evidence would die down after normaliza- 
tion. (201 1) use divergence based methods and non- 
standard smoothing on Wikipedia biographies for 
the same task. We perform our experiments on two 
of their datasets, Wikipedia biographies and Guten- 
berg short stories, and we compare their smoothing 
method with standard Jelinek-Mercer and Dirichlet 
smoothing. 

3 Document Collections 

Our models are trained and evaluated on three 

datasets ^ 

Wikipedia biographies (wiki-bio). The 

Wikipedia dump of English on September 4, 2010 
are used ' to obtain biographies of individuals who 
Uved between tiie years 3800 B.C. to 2010 A.D. 

We extract the lifetime of each individual via each 
article's Infobox birth.date and death.date 
fields. We exclude biographies which do not specify 
one of the fields or which fall outside the year range 
considered. If the birth date is missing, we approxi- 
mate it as 100 years before the death date (similarly 
and conversely when the death date is missing). We 
perform this only to estimate the word distributions 
in the training set. All such documents are discarded 
for validation and test. We tteat the life span of each 
individual as the article's labeled time span. Note 
that the distribution of biographies is quite skewed 
toward recent times, as shown in Figure 1 . 

The resulting dataset contains a total of 280,867 
Wikipedia biographies of individuals whose Ufe- 
times begin and end within the year range consid- 

*A11 three will be released upon publication, including pro- 
cessing and extraction needed for replication of experiments. 

''http : / / download . wikimedia . 
org/enwiki/20100 904/ 
enwik.i-2 0100904-pages-articles . xml . 
bz2 
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Figure 1 : Graph of number of births per year in the 
Wikipedia biography training set. 



Year 



400 B.C. 



50 A.D. 



1000 A.D. 



2000 A.D. 



Sample Text 



The Carthaginians occupy Malta. 
War breaks out between Sparta and Elis. 
San Lorenzo Tenochtitln is abandoned. 
Thucydides, Greek historian dies. 
The catapult is invented by Greek engi- 
neers. 



Claudius adopts Nero. 

Phaedrus, Roman fabulist dies. 

The Epistle to the Romans is written. 

Abgarus of Edessa, king of Osroene dies. 

Hero of Alexandria invents steam turbine. 



Dhaka, Bangladesh, is founded. 
The Diocese of Koobrzeg is founded. 
Garcia IV of Pamplona dies 
Gunpowder is invented in China. 
Middle Horizon period ends in the Andes. 



Tate Modern Gallery opens in London. 
Tuvalu joins the United Nations. 
The last Mini is produced in Longbridge. 
The Constitution of Finland rewritten. 
Patrick O'Brian, English writer dies. 



Table 1: Sample text from 5 different years in wiki- 
year dataset. 



ered (3800 B.C. to 2010 A.D.). These biographies 
are randomly split into subsets for training, develop- 
ment, and testing. We remove documents from de- 
velopment and test sets if either their birth_date 
or death_date missing. This leaves us with 
224,476 training articles, 8,358 development articles 
and 8,440 test articles. 

Wiki-year pages (wiki-year). Wikipedia has a 
collection of pages corresponding to various years 
that describe the events that occurred for a given 
year. ^. Each page has the corresponding year as 
its label and the text contains all the events that oc- 
curred in that year - some examples are shown in 
Table 1. Pages for years before 500 B.C. at times 
contain events that span several years, so we restrict 
the documents to be those from 500 B.C. to 2010 
A.D.^ The 2,51 1 documents for this span are divided 
into even years for development (1256 documents) 
and odd years for testing (1255 documents). 

Table 1 shows random sample lines from four 
wiki-year pages. The lines are terse and the text as a 
whole contain very little temporal expressions. 

Gutenberg short stories (gutss). We collected 
678 English short stories published between 1798 
to 2008, obtained from the Gutenberg Project. 
Whereas with Wikipedia biographies we use labeled 
time spans corresponding to lifetimes, Gutenberg 
stories are labeled by publication year. The average, 
minimum and maximum word count of these stories 
are (roughly) 14,000, 11,000 and 100,000 respec- 
tively. Stories are randomly split into a development 
and test set of 333 and 345 documents, respectively. 

Notation. We refer to biographies, stories and 
Wiki-Year pages alike as documents, and each 
dataset as defining a document collection c consist- 
ing of N documents: c = di-j^. 

4 Model 

Similar to previous work, we represent continuous 
time via discrete units. Our formalization most 
closely follows that of Alonso et al. (2009). The 



http : //en. wikipedia. org/wiki/List_of_ 
years 

'For example the events "Proto-Greek invasions of Greece.", 
"Minoan Old Palace (Protopalatial) period starts in Crete." etc. 
are present in the text for 1878 as well as 1880 B.C. These oc- 
curred around 1880 B.C. but their exact occurrence date is un- 
Imown. 



smallest temporal granularity we consider in this 
work is a single year. 



Chronon-specific smoothing, in turn, is a special 
case of Dirichlet smoothing where: 



4.1 Estimation 

Let a span of multiple, contiguous years be some in- 
terval r = , ye] , where yg and refer to start 
and end years, respectively. As noted in §3, we also 
know the year range covered by each document col- 
lection and restrict our overall timeline correspond- 
ingly to the span Tq = [yo^yv), covering a total of 
yr -yo = ^ years. 

A chronon is an atomic interval x upon which a 
discrete timeUne is constructed (Alonso et al., 2009). 
In this paper, a chronon consists of 5 years, where 5 
is a tunable parameter. Given 6, the timeline is 
decomposed into a sequence of n contiguous, non- 
overlapping chronons x = xi:„, where n = y. 

A "pseudo-document" cF is created for each 
chronon x as the concatenation of all training doc- 
uments whose labeled span overlaps x. For ex- 
ample, for a chronon size 5=25 years, the biogra- 
phy of Abraham Lincoln (1809-1865) is included in 
pseudo-documents for each of the chronons repre- 
senting 1800-1825, 1826-1850, and 1851-1875. 

A chronon model 0^ is estimated from the 
pseudo-document (F and smoothed via interpolation 
with the collection. Chronon models are smoothed 
in three ways: a) Jelinek-Mercer smoothing (JM) 
(Zhai and Lafferty, 2004), b) Dirichlet smoothing 
(Zhai and Lafferty, 2004), and c) chronon-specific 
smoothing (CS) (Kumar et al., 2011). For all three, 
for each word w, 0^ can be computed as a mixture 
of document d and document collection c maximum- 
likelihood (ML) estimates: 

e?i, = Aj| + (i-A)^, (1) 

where and denote the frequency of word w in 
the document or collection respectively, \d\ and |c| 
are the document and collection lengths, and the pa- 
rameter A specifies the smoothing strength. In case 
of Jelenik-Mercer smoothing, the value of A is cho- 
sen directly via tuning over values from zero to one. 
With Dirichlet smoothing, A is chosen as: 



is a hyper-parameter tuned on the development set. 



where | V^a: U V^l denotes the document-chronon spe- 
cific vocabulary for some collection document di 
and pseudo-document d^ and ^ is a prior for hyper- 
parameter iJL that is tuned on the development set. 

4.2 Estimation 

We calculate the affinity between each chronon x 
and a document d by estimating the discrete distri- 
bution P{x\d). In the next section, we use P{x\d) 
to infer affinity between d and different chronons. 
The mid-point of (see section 4.1) the most likely 
chronon is then returned as the predicted year by the 
model. We define two primary models for estimat- 
ing P{x\d). The first approach estimates the likeli- 
hood of d for each chronon; via Bayes rule, this is 
combined with a chronon prior to calculate the like- 
lihood of each chronon for d. The second approach 
ranks chronons based on the divergence between la- 
tent unigram distributions P{w\d) and P{w\x) (Laf- 
ferty and Zhai, 2001a). 

Ranking by document likelihood The language 
modeling approach for information retrieval was 
originally formulated as query-likelihood (Ponte and 
Croft, 1998). For our task, the document is the 
"query" for which we wish to rank chronons. We 
refer to this approach as document-likelihood (DL). 

We estimate P{x\d) oc P{d\x)P{x) via Bayes 
Rule. Assuming unigram modeling, the Ukelihood 
of a test document d is given by: 

P{d\x) = \{Ql (4) 

where the parameters of 0^ are estimated from the 
chronon x's pseudo-document d^, as described in 
Section 4.1. 

Just as informed document priors (e.g. PageR- 
ank or document length) inform traditional docu- 
ment ranking in information retrieval, an informed 
prior over chronons has potential to benefit our task 
as well. We adopt a chronon prior intuitively in- 
formed by the distribution of training documents 



over chronons: 



train t ' 



(5) 



where dtrain is a training document, is the 
pseudo-document for chronon x and \dtrain € rf^l 
is the number of dated training documents overlap- 
ping with chronon x. 

Ranking by model comparison Zhai and Laf- 
ferty (2001b) propose ranking via KL-divergence 
between a query and each collection document. Ku- 
mar et al. (2011) use this approach to compute 
P{x\d), which is estimated by computing the in- 
verse KL-divergence of x and d and normalizing this 
value with the sum of inverse divergences with all 
chronons 



P{x\d) 



p(G^||e^ 



(6) 



It is straighforward to see that their formulation is 
rank equivalent to standard model comparison rank- 
ing with negative KL-divergence (de Jong et al., 
2005; Kanhabua and N0rvag, 2008): 



p(x|d) cxp(e'^lie^) 



(7) 

Lafferty and Zhai showed such ranking is equiv- 
alent to generating the query (i.e. query-likelihood) 
assuming a uniform document prior and the query 
model being estimated by relative frequency (Laf- 
ferty and Zhai, 2001b). This means that for our 
task, if we adopt a uniform prior over chronons 
and estimate the document model by relative fre- 
quency, then KL-ranking and document-likelihood 
approaches will be rank equivalent. 

Prediction Having determined P{x\d), we choose 
the midpoint y of the most likely chronon; for a 
chrononx = [y, y+S], the mid-point is y = y+S/2. 

5 Experimental Setup 

Data. To test the ability of word-based models 
to predict timestamps for documents, all temporal 

expressions identified in each document using the 
Heidel-Time temporal tagger (Strotgen and Gertz, 
2010) are removed. All numeric tokens and standard 



stopwords are also removed. The remaining tokens 
produce a vocabulary of 374,973 words for the en- 
tire Wikipedia biography corpus. Heidel-Time also 
provides the first two dates present in the text, which 
we use as a strong baseline for the biography task. 

'I\ining and smoothing For each model-i-task, we 

tune the parameters 5, /i, ^, and A over the devel- 
opment sets of the corresponding dataset. As in 
prior work (de Jong et al, 2005; Kanhabua and 
N0rvag, 2008; Kumar et al., 2011), we smooth 
chronon pseudo-document language models (for all 
models as well as smoothing techniques) but not 
document models. While smoothing both may po- 
tentially help, smoothing the former is strictly neces- 
sary for KL-divergence to prevent division by zero. 

Target predictions For Wikipedia biographies, 
the predicted y represents the mid-point of the in- 
dividual's hfe span; for wiki-years, it is the year of 
the events on the page, and for Gutenberg short sto- 
ries it is the publication date of the story. In later 
sections we will present the baseline predictions for 
il for each dataset. 

Error Measurement When predicting a single 
year for a document, a natural error measure be- 
tween the predicted year y (mid-point) and the ac- 
tual year y* is the difference \y — y*\. We compute 
this difference for each document, then compute and 
report the mean y and median y of differences across 
documents. Similar distance error measures have 
also been used with document geolocation (Eisen- 
stein et al., 2010; Wing and Baldridge, 2011). 

Baselines For Wikipedia biographies the first 
baseline (baseline-Iit) is the mid-point of the 
first two temporal-dates extracted by Heidel- 
Time (Strotgen and Gertz, 2010). This is a highly 
effective baseline since it is often the case in 
Wikipedia biographies that the first two dates are 
the birth and death dates. The second baseUne for 
biographies is to always predict the year that has 
greatest number of biographies spanning it, which 
is 1915 (baseline-1915). For Gutenberg stories, we 
take 1903, the midpoint of the range of publication 
dates (1798-2008) as the baseUne (baseline-1903). 
For wiki-years, the baseline is the midpoint of the 
prediction range i.e. ~500+^oio = 755 (baseline- 
755). This assumes that one knows a rough range of 



possible publication dates, which is reasonable for 
many applications and thus provides a good refer- 
ence for comparison. 

We also report oracle error which is the mean 
and median error which would occur if a model 
always picked the correct chronon. This error 
arises because chronons span multiple years; large 
chronons in particular will have higher oracle error 
(but may perform better for actual prediction due to 
better model estimation). 




Figure 2: Tuning for 5 over wiki-bio and wiki-years 
datasets for KL model. ^ (for CS) and A (for JM) are 
fixed at 0.01 and 0.99 respectively. 



6 Results 

6.1 Parameter tuning 

We begin with year prediction experiments on the 
development sets to tune the parameters 6, ^ or 
/i. We parametrize /z as a function of the average 
chronon size in the training set: 



H = \pc\ 



(8) 



c is a constant whose value is dependent upon the 
model and the task. The value of p is tuned over the 
validation set. 
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Figure 3 

Figure 4: Tuning for smoothing parameters (^ and 
A) over wiki-bio and wiki-years datasets for KL 
model. 



Choice of chronon size and smoothing parame- 
ters. We tune the chronon size (6) over the valida- 
tion set and tune the smoothing parameters A, p, and 
^ (depending on the type of smoothing) for the best 6 
obtained. For 6 tuning we assign an arbitrary value 
to the smoothing parameter A. The 6 is tuned for 
each dataset and KL model with CS and JM smooth- 
ings. DL model with Dirichlet/JM smoothing and 
KL model with Dirichlet smoothing use the same 
best 6 obtained for KL model with JM smoothing 
on the respective datasets. For each dataset, model 
and smoothing triad, the smoothing parameter A, ^, 
or p is tuned. Tuning is performed to minimize the 
mean error on the development sets. The search 
space for smoothing parameters ^, A and p includes 
{ le - 12, le - 11, . . . , 0.1, 0.25, 0.75, 0.9, 0.99, 
. . . , 0.999999999 } 

Figures 2 and 4 shows the tuning of 6 and smooth- 
ing parameters (A for JM and ^ for CS) for the 
wiki-bio and wiki-years dataset. All triplets formed 
by KL/DL model x JM/Dirichlet smoothing x 
wiki-years/wiki-bio/gutss dataset use the optimum 
chronon- size obtained for the respective datasets 



from the KL model with JM smoothing. 

From Figure 4 the mean error curve is generally 
smooth for A and ^ unlike the 6, chronon-size pa- 
rameter (figure 2). This makes smoothing the LMs 
robust to a range of values. The 6 has more fluctua- 
tion even in the optimal neighborhood, which makes 
tuning chronon-size more critical. A straightfor- 
ward strategy to reduce this sensitivity is to smooth 
chronon models based on the word distributions of 
neighboring chronons as well as interpolating with 
the collection model, which we intend to explore 
in future. The optimal chronon sizes for the three 
datasets are 10 years for wiki-bio and gutss and 50 
years for wiki-year. 

6.2 Test results. 

Table 2 shows the results for the various models on 
the test sets for all three datasets, using the parame- 
ters tuned on the corresponding development sets. 

Wiki-bio The models beat both baselines easily. 
Note that baseline-ht is quite strong for a large num- 
ber of documents: it gives a median error of zero 
since over half of the documents have birth and 
death dates as their first dates. Nonetheless, it fails 
entirely for many documents, and obviously has Um- 
ited applicability. The models all reduce error by 
one half in comparison to baseline-1915. The best 
model (DL -i- JM smoothing) achieves a mean error 
of 37.4 years, which is quite strong given that the 
prediction range is 5810 years. The mean oracle er- 
ror for the best model is 2.5 years. The mean and 
median error was 36.6 and 22.0 years for the best 
performing model (DL -i- JM smoothing) on the de- 
velopment set. 

Wiki-years The models beat baseline-755 com- 
fortably. Despite the fact that the documents are rel- 
atively short and that any given document contains 
a number of often unrelated events (and thus low 
counts per word type), the results are in line with 
those for wiki-bio, with mean error of 37.9 and me- 
dian error of 20 years for the best models. The mean 
oracle error, 12.4 years, for this dataset is higher due 
to the larger chronon size. The KL model with JM 
smoothing provided the best mean and median er- 
ror of 36.7 and 21.0 years respectively over devel- 
opment set. 



Gutss All models except the one that uses 
chronon- specific smoothing with KL-divergence 
outperform baseline- 1905 on mean error, and even 
that one is better on median. Since these are works 
of fiction with few historical entities mentioned, the 
mean error of 22.9 and median error of 19.0 of the 
best models indicate that the approach is quite ca- 
pable of exploiting implicit temporal cues of ba- 
sic vocabulary choices. Also, recall that the model 
is trained on Wikipedia; this demonstrates that this 
choice of training set works well as the basis for pre- 
dictions on other domains. The mean oracle error 
(for chronons of 10 years) is 2.5 years. For the de- 
velopment set, the mean and median error was 20.4 
and 17.0 years for the best performing model (DL -i- 
JM smoothing). 
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Table 2: Test set results. A=JM, ^=CS, and 
p=Dirichlet smoothing. DL uses the non-uniform 
chronon prior. The best results are bolded, and the 
results of the best model on the corresponding de- 
velopment set are italicized. 

6.3 Output analysis 

Using the output on the development set, we find 
interesting patterns in the predictions made by the 



models and the way they use the words as evidence. 

Time warps (20 1 0) used geotags on Flickr images 
to identify wormholes — ^locations that are not phys- 
ically near but which are nonetheless similar to one 
another. We observe some similar patterns, in our 
case time warps, in our dataset. These are particu- 
larly prominent in wiki-year documents due to their 
terseness as these are list of events that happened in 
a given year. Besides the models trained on wiki-bio 
set add to this phenomenon as the context for the two 
datasets are slightly different. A cluster of dev event 
years from between 250 to 150 A.D. (e.g. wiki-years 
234, 214, 152, 156 etc.) are predicted to be in 2nd 
century B.C. (200 B.C. to 150 B.C.) by our model. 
These event years are very short with an average 
length of 40-50 words per document. The discrimi- 
natory tokens present in these texts include: Roman, 
Empire, Kingdom, Han, Dynasty, China, Selucid, 
Greek, etc.. In the 200-150 B.C. period, all the doc- 
uments in training set are about Greek/Selucid, Ro- 
man and Chinese (mostly from Han dynasty) emper- 
ors/personalities (e.g. Attalus I, Eratosthenes, Plau- 
tus, Emperor Gaozu of Han, Emperor Hui of Han, 
Zhang Qian, Emperor Wen of Han etc.) and con- 
tain similar prominent terms as the wiki-year event 
texts. This common collection of terms pushes the 
model to resolve wiki-year texts to 2nd century B.C. 
This happens because of the relative frequency of 
such terms in B.C. and A.D.: although these terms 
are present in the A.D. chronons, their proportion 
with respect to other terms is much smaller. Test 
documents that contain these terms are thus attracted 
to the B.C. chronons since they have these terms in 
generally higher proportion. 

Another interesting cluster is short documents 
containing similar terms from 200-800 A.D. that are 
resolved to the mid- 6th century A.D. The short wiki- 
year texts (e.g. years 246, 486, 750, 822, etc.) con- 
tain co-occurring set of terms like Byzantine, Em- 
pire, Roman, Arab, Conquest, Islam, and Caliphate. 
These short year events text contain events related 
to mostly Byzantine wars, emperors, Islamic/Arab 
conquest. Caliphates etc. These are resolved to 
the mid-6th century A.D. period that predominantly 
contains biographies of Islamic Caliphates (e.g. Abd 
al-Malik, Abu Bakr, Ali, Umar etc.) and Byzantine 
emperors and prominent personalities (e.g. Mau- 



rice, Fausta, Constans 11 etc.) which has predomi- 
nant terms such as: Byzantine, Empire, Caliph, Is- 
lam, and conquest. 

Discriminative Words Table 3 and 4 shows the 
top and bottom 25 words in the descending order of 
their strengths, where the predictive strength score 
of a word w is calculated as average prediction error 
of all the documents that contain the word w. The 
majority of the words that are most predictive are 
uncommon nouns, especially uncommon last names 
or famous titles e.g. capote, komatsu, and cran- 
mer. Words such as tele, wavelength, electorates, 
teleplay, sap (the company) also have strong tempo- 
ral connection as these have never been used before 
19th century. The least predictive ones are mostly 
common words such as goodness, oneself, moral- 
ity, tub, crates, and lantern. The uncommon words 
among the least predictive are generally present in 
just one or two documents for which our model per- 
forms very poorly. It is highly likely that these 
words might be inducing those warps due to their 
predominance and uniqueness. 

7 Conclusion 

Using words alone, it is possible to identify the time 
period that a document is about (via the Wikipedia 
datasets) or the time period in which it was writ- 
ten (via the Gutenberg dataset). In the former case, 
the presence of named entities dominates the texts, 
and their names provide strong evidence for partic- 
ular historical periods. For the latter, the texts are 
fictional (including science fiction), and they rarely 
mention historical entities. For these, general terms 
that are indicative of a given time period dominate 
the prediction. Interestingly, the models that are 
used (successfully) for this later task are trained on 
Wikipedia biographies about historical individuals, 
but which were written in the last decade. 

The predictions made by our models provide 
a natural counterpart to other temporally sensitive 
models of word choice, such as Dynamic Topic 
Models (DTMs) (Blei and Lafferty, 2006). DTMs 
assume that documents are labeled with dates; our 
model could thus be used to create labels for an oth- 
erwise un-dated set of documents which can then 
be analyzed with DTMs. An important aspect of 
our work is that it opens opportunities for analyzing 
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Table 3: Top 25 most predicitve words in descending order (left to right and top to bottom) from wiki-bio 
dev set. 
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Table 4: Bottom 25 least predicitve words in descending order (left to right and top to bottom) from wiki-bio 
dev set. 



sub-parts of documents, such as chapters, sections 
and paragraphs of books. Consider, for example, 
Samuel Goodrich's "The Second Book of History" 
from 1840, which covers thousands of years of his- 
tory for many parts of the world. 

Of course, many texts include explicit dates, 
and exploiting their presence via approaches such 
as (2012) would only strengthen our predictions. 
Also, they create opportunities for using weaker, but 
more pin-pointed, supervision: strings identified as 
dates with high-confidence can be pivots for learning 
word distributions. This would obviate the need for 
labeled training material such as Wikipedia biogra- 
phies, and thereby enable our methods to be used 
and adapted for a wide variety of genres. Given 
decent temporal expression identifiers for other lan- 
guages, this could be used to bootstrap models for 
more languages as well. 
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